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Abstract 

We consider the closely related problems of bandit convex optimization with two-point feedback, and 
zero-order stochastic convex optimization with two function evaluations per round. We provide a simple 
algorithm and analysis which is optimal for convex Lipschitz functions. This improves on a , which only 
provides an optimal result for smooth functions; Moreover, the algorithm and analysis are simpler, and 
readily extend to non-Euclidean problems. The algorithm is based on a small but surprisingly powerful 
modification of the gradient estimator. 


1 Introduction 

We consider the problem of bandit convex optimization with two-point feedback Q]]. This problem can be 
defined as a repeated game between a learner and an adversary as follows: At each round t, the adversary 
picks a convex function f t on W l , which is not revealed to the learner. The learner then chooses a point w ; 
from some known and closed convex set W C R d , and suffers a loss // (w/). As feedback, the learner may 
choose two points w'. w" €= W and receiveQ /, (w'J. f) fw"). The learner’s goal is to minimize average 
regret, defined as 

1 T i T 

t= 1 t=l 

In this note, we focus on obtaining bounds on the expected average regret (with respect to the learner’s 
randomness). 

A closely-related and easier setting is zero-order stochastic convex optimization. In this setting, our 
goal is to approximately solve F( w) = min we w %[/(w; £)], given limited access to {/(■;&)}£= i where 
are i.i.d. instantiations. Specifically, we assume that each /(•, £/,) is not directly observed, but rather 
can be queried at two points. This models situations where computing gradients directly is complicated 
or infeasible. It is well-known 0 that given an algorithm with expected average regret Rt in the bandit 
optimization setting above, if we feed it with the functions f t ( w) = /(w;£ t ), then the average w t = 
Y XXi w t of the points generated satisfies the following bound on the expected optimization error: 

E[T(wy)] — min F( w) < Rt- 

wGW 

'This is slightly different than the model of (T), where the learner only chooses w(, w" and the loss is | (/t(w() + /*(w")). 
However, our results and analysis can be easily translated to their setting, and the model we discuss translates more directly to the 
zero-order stochastic optimization considered later. 
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Thus, an algorithm for bandit optimization can be converted to an algorithm for zero-order stochastic opti¬ 
mization with similar guarantees. 

The bandit optimization setting with two-point feedback was proposed and studied in |[T]]. Independently, 
ll 8 l and considered two-point methods for stochastic optimization. Both papers are based on randomized 
gradient estimates which are then fed into standard first-order algorithms (e.g. gradient descent, or more 
generally mirror descent). However, the regret/error guarantees in both papers were suboptimal in terms of 
the dependence on the dimension. Recently, 0 considered a similar approach for the stochastic optimiza¬ 
tion setting, attaining an optimal error guarantee when /(•;£) is a smooth function (differential and with 
Lipschitz-continuous gradients). Related results in the smooth case were also obtained by ( 6 ]|. However, to 
tackle the general case, where /(■;£) may be non-smooth, 0 resorted to a non-trivial smoothing scheme 
and a significantly more involved analysis. The resulting bounds have additional factors (logarithmic in 
the dimension) compared to the guarantees in the smooth case. Moreover, an analysis is only provided for 
Euclidean problems (where the domain W and Lipschitz parameter of f t scale with the L‘> norm). 

In this note, we present and analyze a simple algorithm with the following properties: 

• For Euclidean problems, it is optimal up to constants for both smooth and non-smooth functions. This 
closes the gap between the smooth and non-smooth Euclidean problems in this setting. 

• The algorithm and analysis are readily applicable to non-Euclidean problems. We give an example 
for the 1 -norm, with the resulting bound optimal up to a ylog (d) factor. 

• The algorithm and analysis are simpler than those proposed in 0. They apply equally to the ban¬ 
dit and zero-order optimization setting, and can be readily extended using standard techniques (e.g. 
to strongly-convex functions, regret/error bounds holding with high-probability rather than just in 
expectation, and improved bounds if allowed k > 2 observations per round instead of just two). 

Like previous algorithms, our algorithm is based on a random gradient estimator, which given a func¬ 
tion / and point w, queries / at two random locations close to w, and computes a random vector whose 
expectation is a gradient of a smoothed version of /. The papers 00 [ 6 J essentially use the estimator which 
queries at w and w + <fu (where u is a random unit vector and 6 > 0 is a small parameter), and returns 

^(/(w + 5u)-/(w))u. (1) 

The intuition is readily seen in the one-dimensional (d = 1) case, where the expectation of this expression 
equals 

(f(w + 6) - f(w - 6)), ( 2 ) 

which indeed approximates the derivative of / (assuming / is differentiable) at w, if 6 is small enough. 

In contrast, our algorithm uses a slightly different estimator (also used in 0), which queries at w — 
<)u, w + <5u, and returns 

777 (/(w + 6u) - /(w - <5u)) u. (3) 

2d 

Again, the intuition is readily seen in the case d = 1, where the expectation of this expression also equals 
Eq. ©. 

When 6 is sufficiently small and / is differentiable at w, both estimators compute a good approximation 
of the true gradient V/(w). However, when / is not differentiable, the variance of the estimator in Eq. 0 
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can be quadratic in the dimension d, as pointed out by 0|: For example, for /(w) 
second moment equals 


E 


| (/(<5u) - /(0)) u 


E [d * 1 2 3 ||u|| 2 ] = d 2 . 


and w = 0, the 


Since the performance of the algorithm crucially depends on the second moment of the gradient estimate, 
this leads to a highly sub-optimal guarantee. In f?|, this was handled by adding an additional random 
perturbation and using a more involved analysis. Surprisingly, it turns out that the slightly different estimator 
in Eq. © does not suffer from this problem, and its second moment is essentially linear in the dimension d. 


2 Algorithm and Main Results 


We consider the algorithm described in Figure [Q which performs standard mirror descent using a random¬ 
ized gradient estimator g, of a (smoothed) version of f t at point w,. We make the assumption that one can 
indeed query f t at any point w/ + d/ U/ as specified in the algorithm^. 


Algorithm 1 Two-Point Bandit Convex Optimization Algorithm 

Input: Step size q, function r : W 4 R, exploration parameters St> 0 

Initialize 6 1 = 0. 

for t = 1,... ,T — 1 do 

Predict w ( = arg max we w(#t, w) — r( w) 

Sample u, uniformly from the Euclidean unit sphere {w : ||w ||2 = 1} 
Query f t (w t + S t u t ) and f t ( w t - 5 t u t ) 

Set g t = ^ (/t(w t + S t u t ) - f t ( w f - S t u t )) u f 
Update 6 t+ i =6 t - qgt 

end for 


The analysis of the algorithm is presented in the following theorem: 


Theorem 1. Assume the following conditions hold: 

1. r is 1-strongly convex with respect to a norm || • ||, and sup wgl/ y r( w) < R 2 for some R < oo. 

2. ft is convex and G 2 ~Lipschitz with respect to the 2-norm || • || 2 - 

3. The dual norm || • ||* of || • || is such that yEuJjutllJ < p* for some p* < oo. 

If // = c^VdT ’ ant ^ ^ c ^ losen suc h that St < p* Rthen the sequence wj,..., w t generated by the 
algorithm satisfies the following for any T and w* € VV: 


E 


T T 


t =i 


t =i 


< cp*G2R\ 


T’ 


where c is some numerical constant. 

2 This may require us to query at a distance St outside W. If we must query within W, then one can simply run the algorithm on 
a slightly smaller set (1 — <5)W, where 5 > St for all t, ensuring that we always query at W. Since the formal guarantee in Thm.Q] 
holds for arbitrarily small St, and each ft is Lipschitz, we can always take S and St small enough so that the additional regret/error 
incurred is negligible. 
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We note that conditions [T|is standard in the analysis of the mirror-descent method (see the specific corol¬ 
laries below), whereas conditions [2] and [3] are needed to ensure that the variance of our gradient estimator is 
controlled. 

As mentioned earlier, the bound on the average regret which appears in Thm. [T] immediately implies a 
similar bound on the error in a stochastic optimization setting, for the average point w t = -p 1 w /- 
We note that the result is robust to the choice of p, and is the same up to constants as long as p = 
@(R/p*G 2 VdT). Also, the constant c, while always bounded above zero, shrinks as 6/ —»• 0 (see the 
proof for details). 

As a first application, let us consider the case where || • || is the Euclidean norm || • H 2 . In this case, we 
can take s(w) = Jy 11 w 11 2 ’ and the algorithm reduces to a standard variant of online gradient descent, defined 
as 0 t +i = 6 t — gt and w t = argmin we yy ||m — H 2 - In this case, we get the following corollary: 


Corollary 1. Suppose ft for all t is G 2 -Lipschitz with respect to the Euclidean norm, and W C {w : 
|| w l|2 5: i?}. Then using || • || = || • H 2 and r( w) = ^||w|||, it holds for some constant c and any w* € W 
that 


IE 


1 T 


1 T 


<cG 2 R 



The proof is immediately obtained from Thm.[I] noting that jt* = 1 in our case. This bound matches (up 
to constants) the lower bound in 01, hence closing the gap between upper and lower bounds in this setting. 

As a second application, let us consider the case where || • || is the 1-norm, || • || 1 , the domain W is the 
simplex in M d , d > 1 (although our result easily extends to any subset of the 1-norm unit ball), and we use 
a standard entropic regularizer: 


Corollary 2. Suppose ft for all t is G\ -Lipschitz, with respect to the L\ norm. Then using || • || = || • ||i and 
r ( w ) = Eli W{ log (dwi), it holds for some constant c and any w* € W that 


E 


T T 

-f /*( w * 


< C Gil 


' dlog 2 (d ) 


t =1 t =1 

This bound matches (this time up to a logarithmic factor) the lower bound in 01 for this setting . 


Proof The function r is 1-strongly convex with respect to the 1-norm (see for instance j9J, Example 2.5), 
and has value at most log(d) on the simplex. Also, if /) is G'| -Lipschitz with respect to the 1-norm, then it 
must be \fdG\ -Lipschitz with respect to the Euclidean norm. Finally, to satisfy condition [3] in Thm. Q] we 
upper bound yE[||u7||^J using the following lemma, whose proof is given in the appendix: 

Lemma 1. If u is uniformly distributed on the unit sphere in d > 1, then -yEjjjujj^J < c\J log ^ d ' 1 ' where 
c is a positive numerical constant independent of d. 

Plugging these observations into Thm.[[]leads to the desired result. □ 


3 Proof of Theorem d 

As discussed in the introduction, the key to getting improved results compared to previous papers is the use 
of a slightly different random gradient estimator, which turns out to have significantly less variance. The 
formal proof relies on a few simple lemmas listed below. The key lemma is Lenmia[5j which establishes the 
improved variance behavior. 
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Lemma 2. For any w* € W, if /io/<A f/zaf 



t= l ' t=i 


This lemma is the canonical result on the convergence of online mirror descent, and the proof is standard 
(see e.g. @). 


Lemma 3. Define the function 


ft ( w ) = E ut [f t {w + 5 t u t )] , 


over W, where U/ zs a vector picked uniformly at random from the Euclidean unit sphere. Then the function 
is convex, Lipschitz with constant G 2l satisfies 


sup |/t(w) - ft{ w)| < 5 t G 2 , 


wGW 


and is differentiable with the following gradient: 



Proof The fact that the function is convex and Lipschitz is immediate from its definition and the assump¬ 
tions in the theorem. The inequality follows from U/ being a unit vector and that f) is assumed to be 
G 2 -Lipschitz with respect to the 2-norm. The differentiability property follows from Lemma 2.1 in 0- □ 

Lemma 4 . For any function g which is L-Lipschitz with respect to the 2-norm, it holds that if u is uniformly 
distributed on the Euclidean unit sphere, then 



for some numerical constant c. 

Proof A standard result on the concentration of Lipschitz functions on the Euclidean unit sphere implies 
that 


p r(lff( u ) - E[p(u)]| >t)< 2 exp (- c'dt 2 /L 2 ) 

for some numerical constant d > 0 (see the proof of Proposition 2.10 and Corollary 2.6 in Q). Therefore, 



which equals cL 2 /d for some numerical constant c. 


□ 


Lemma 5. It holds that E[gt|w*] = V/t(w() (where ft(-) is as defined in Lemma 0, and E[||g*|| 2 |wf] < 
cdp 2 G 2 for some numerical constant c. 
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Proof. For simplicity of notation, we drop the t subscript. Since u has a symmetric distribution around the 
origin, 


w] = E u 
= Eu 
= E u 
= E„ 


d 


— (/(w + 5u) - /(w - 5u)) u 


25 

^(/(w + 5u))u 
^(/(w + <5u))u 
^/(w + <5u)u 


+ E u 
+ E u 


7 ^/(w - <Su)(-u) 
^/( w + <Su)(u) 


which equals V/(w) by Lemma[3] 

As to the second part of the lemma, we have the following, where a is an arbitrary parameter and where 
we use the elementary inequality (a — b) 2 < 2 (a 2 + b 2 ). 

E[||g|| 2 |w] = Eu II (/(W + 5u) - /(w - <5u)) u| 


25 

~ 4<f2 Eu L 
-£• 

= 2S 2 ( Eu 


u|| 2 (/( w + 5u) - /(W - (iu)) 2 
u|| 2 ((/( w + ^ u ) ~ a) - (/(w - 5u) - a)) 2 
+ <5u) — a) 2 + (/( w — <5u) — a) 2 


u 


w 


(e u ||u || 2 (f(w + 6u) - a) 2 +E U ||u|| 2 (/(w - 5u) - a) 2 ) . 


Again using the symmetrical distribution of u, this equals 


d 2 . 
2S 2 ' 


(e u ||u || 2 (/(w + <5u) — aj 2 +E U ||u|| 2 (/(w + 5u) - a) 2 ) 


<5 2 


E, 


ll u ll* (/(w + 5u) - af 


Applying Cauchy-Schwartz and using the condition ^lE u ||u||£ < p* stated in the theorem, we get the upper 
bound 

^2 \/Eu [||u||*]^Eu" (/(w + <5u) - a) 4 = (f(w + 5u)-a) 

In particular, taking a = E u [/(w + <5u)] and using Lemma [4] (noting that /(w + du) is G^-Lipschitz w.r.t. 
u in terms of the 2-norm), this is at most ^jp-c ' C ' 2 p = cdpfG 2 as required. □ 

We are now ready to prove the theorem. Taking expectations on both sides of the inequality in Lemma[2l 
we have 


E 


~ w* 

_t= 1 


< -R 2 + p^E [||gt|| 2 ] = -R 2 + ??^E [E [||g t || 2 |w t ]] . (4) 

t =1 ^ t =1 
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Using LemmalU the right hand side is at most 

-R 2 + rjcdplGlT 
V 

The left hand side of Eq. ©, by Lemma[5]and convexity of j), equals 


E 


^(E[g t |wi],w f - w* 


,t =1 


= E 


^(V/t(w t ),w f - w* 


. 4=1 


> E 


. 4=1 


Y (ft ( w f) - /*( w " 


By Lemma [3] this is at least 


E 


(/t( w i) — /i( w *)) 


. 4=1 




4=1 


Combining these inequalities and plugging back into Eq. ©, we get 


E 


(/*( w *) ~ /t( w *)) 


. 4=1 


T 1 

<G 2 Y^t + -R 2 + cdplGlnT. 


4=1 


Choosing rj = R /(p* C- 2 \fdT ), and any bt < p*R^/d/T, we get 


E 


Y - /t( w *)) 


L 4=1 


< (c + 2 )p*G 2 RVdT. 


Dividing both sides by T, the result follows. 
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A Proof of Lemma Q] 


II l|4 

We note that the distribution of u is equivalent to that of 4f%r, where n ~ M (0, I A is a standard Gaussian 

Il n ll2 

random vector. Moreover, by a standard concentration bound on the norm of Gaussian random vectors (e.g. 
Corollary 2.3 in 0, with e = 1/2): 


max 






< exp 



Finally, for any value of n, we always have < 1, since the Euclidean norm is always larger than the 
infinity norm. Combining these observations, and using 1 a for the indicator function of the event A, we 
have 


ie[H 4 J = ie 



= Pr 






< exp 



= exp 
< exp 



* 1 + Pr | ||n|| 2 > \/^ | E "“ ll0 ° 


n 


(vW2)‘ 


+ I P E 


|n|| 4 1 M 


2>y/d/2_ 


+ ^2 E [H n lloo] 



n 


> 


(5) 


Thus, it remains to upper bound E [||nH^] where n is a standard Gaussian random variable. Letting 
n = (m,..., rid), and noting that n i,, n,i are independent and identically distributed standard Gaus¬ 
sian random variables, we have for any scalar z > 1 that 


n 

PrdlnHoo <z) = JJPr(|rai| < z) = (Pr(|m| < z)) d 
1=1 

, (t) 

= (1 — Pr(|ni| > z)) > 1 — dPr(|ni| > z) 

( 2 ) 

= 1 — 2dPr(ni > z) > 1 — dexp(— z 2 /2), 


where (1) is Bernoulli’s inequality, and (2) is using a standard tail bound for a Gaussian random variable. 
In particular, the above implies that 

PrdlnHoo > z) < dexp(— z 2 /2). 
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Therefore, for an arbitrary positive scalar r > 1, 


E HU = 


< 


/*oo 

/ p r(ll n ll t 0 >z)dz 

Jz =0 

nr noo 

/ Idz + / Pr (||n||oo > \fz) dz 
J z =0 J z=r 


<r + J~ dexp (-4') dz 


— r + 4d(2 + \/r) exp ( 


4 ) 


In particular, plugging r = 41og 2 (d) (which is larger than 1, since we assume d > 1), we get 4(2 + 
2 log(d) + log 2 (r/))- Plugging this back into Eq. ©, we get that 


E[||u||L] 


< 


exp 


f_d _\ + 16 2 + 21og(d) +log 2 (d) 
V 16/ d? 


which can be shown to be at most d 


particular, this means that -yi£[j}uj|^ 


l0S J^ ) p)l d > 1, where d < 150 is a numerical constant. In 
< \fd as required. 
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